Web-Prospector - An Automatic, Site-Wide Wrapper Induction Approach for Scientific Deep-Web Databases
نویسندگان
چکیده
Wrapper induction techniques traditionally focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. Applying such techniques to Web sites generated from biological databases, however, we found that there is a need for wrapping of structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such scientific web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels – giving further cues for solving the web site wrapping task. In this paper we present a novel approach to automatic information extraction from whole Web sites that considers the novel challenge and takes advantage of the additional clues commonly available in scientific deep Web databases. The solution consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.
منابع مشابه
Site-Wide Wrapper Induction for Life Science Deep Web Databases
We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated f...
متن کاملAnnotation for Query Result Records based on Domain-Specific Ontology
The World Wide Web is enriched with a large collection of data, scattered in deep web databases and web pages in unstructured or semi structured formats. Recently evolving customer friendly web applications need special data extraction mechanisms to draw out the required data from these deep web, according to the end user query and populate to the output page dynamically at the fastest rate. In...
متن کاملThink before you Act! Minimising Action Execution in Wrappers
Web wrappers access databases hidden in the deep web by first interacting with web sites by, e.g., filling forms or clicking buttons, to extract the relevant data from the thus unearthed result pages. Though the (semi-)automatic induction and maintenance of such wrappers has been extensively studied, the efficient execution and optimization of wrappers has seen far less attention. We demonstrat...
متن کاملDIADEM: Thousands of Websites to a Single Database
The web is overflowing with implicitly structured data, spread over hundreds of thousands of sites, hidden deep behind search forms, or siloed in marketplaces, only accessible as HTML. Automatic extraction of structured data at the scale of thousands of websites has long proven elusive, despite its central role in the “web of data”. Through an extensive evaluation spanning over 10000 web sites ...
متن کاملAutomatic Generation of Deep Web Wrappers based on Discovery of Repetition
A Deep Web wrapper is a program that extracts contents from search results. We propose a new automatic wrapper generation algorithm which discovers a repetitive pattern from search results. The repetitive pattern is expressed by token sequences which consist of HTML tags, plain texts and wild-cards. The algorithm applies a string matching with mismatches to unify the variation from the template...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009